September 2020

Welcome

Who are you?

  • What is your subject/ your profession?
  • How did you get to know R?
  • What is your motivation to learn more?

Organisation of the course

  • up to 2 hours of online-class from Monday to Tuesday
  • homework, to be done until next day
  • 1 hour for questions in the afternoon or evening via video-call, e-mail, stud-ip
  • if you want you can send me your homework until 6 pm
  • small project after the course (approx. 5 hours, or 35 hours)

Introduction

Short history of R

  • created in 1992
  • extends the S programming language
  • S was commercial, R is free software under the GNU license
  • implemented by Ross Ihaka and Robert Gentleman
  • since 1997 supported by the R Core Group
  • Aim: A programming language by Statisticians for Statisticians

The S Philosophy

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

— John Chambers, Stages in the Evolution of S

Taxonomy of Programming Languages

Low level:

  • Machine Code
  • Assembly

High level(s):

  • 3rd generation, e.g. C, C++, Java
  • 4th generation, e.g. R, Python, MatLab
  • 5th generation/ artificial languages

Characteristics of R

  • it is considered object oriented
  • but take care, object orientation in R is nasty!
  • imperative (you implement the "how" not the "what")
  • functional (you put instructions in functions, you can pass functions to functions)
  • not strongly typed (you do not know the type of an object before running your code)

Dos and Dont's

R is great for

  • statistical modelling
  • mathematical calculations, vector arithmetic, …
  • data visualisations
  • websites
  • reports

You might want to choose an other programming language for

  • native applications
  • extensive calculations
  • agent-based models

RStudio Programming Environment and Workflow

  • live presentation, see course notes for explanations

Two questions

  1. If you type code directly in the console, will the objects you create occur in the environment frame? Could you imagine any problems occuring because of this?

  2. When I want to close R, there is this pop-up window asking 'Save workspace image to ~/Documents/informatica/R/.RData?'. What does this mean? Is this useful?

Answers

  1. Yes. All code you run will result in objects saved in the environment. You can click on the small broom symbol to delete all objects. A problem occurs when you change objects, but you do not consider that in your script.

  2. When you save the workspace, all object you created within the environment will be retained. The workspace is saved to a hidden file called .RData, saved to your current working directory / project directory. Advantages of saving the workspace: You start at the same point, with the same objects you had, when you last closed your session. You do not need to recompute your objects, which can be time-intensive. Disadvantages: Your working directory is not in a clean state, when you start. You write code, so that you are able to reproduce your calculation, so no need to store objects. There are other ways to store objects, which take a long time to compute (-> rds-files). ")

R Data types

  • Character
  • Numeric, integer
  • Logical
  • Factors
  • Vectors, matrices
  • Data frames

Character

firstname <- "Diren"
surname <- "Senger"
name <- paste(firstname, surname) # Concatenation
name
## [1] "Diren Senger"
class(name) # Check the type
## [1] "character"
age <- as.character(25) # Casting
class(age)
## [1] "character"

Numeric

a <- 1
b <- 2
c <- a + b
c
## [1] 3
d <- b*c/a
d
## [1] 6
e <- d^2
e
## [1] 36

Integer

f <- as.integer(3.14)
f
## [1] 3

Logical

a <- TRUE
b <- FALSE

# or
a|b
## [1] TRUE
# and
a&b
## [1] FALSE
# not
!a
## [1] FALSE

Logical

4 == 5
## [1] FALSE
4 < 5
## [1] TRUE
4 >= 4
## [1] TRUE
somevariable <- NULL
is.null(somevariable)
## [1] TRUE

Vectors

# a simple numeric vector
myvector <- c(1,4,6)
myvector
## [1] 1 4 6
# a vector of type character
myfamily <- c("Paula", "Ali", "Emma")
myfamily
## [1] "Paula" "Ali"   "Emma"
# get the first element of a vector
myfamily[1]
## [1] "Paula"

Vectors

# apply function to all items:
myvector + 1
## [1] 2 5 7
paste(myfamily, "Meier")
## [1] "Paula Meier" "Ali Meier"   "Emma Meier"
# concatenate vectors:
longervector <- c(myvector, 5:7)
longervector
## [1] 1 4 6 5 6 7

Vectors

# create a sequence
odd <- seq(from = 1, to = 10, by = 2)
odd
## [1] 1 3 5 7 9
# create a boring sequence
boring <- rep(1, 10)
boring
##  [1] 1 1 1 1 1 1 1 1 1 1

Factors

fac <- factor(c("good", "better", "best"))
levels(fac)
## [1] "best"   "better" "good"
nlevels(fac)
## [1] 3

Named List

result <- list(data="fancy test data", mean=5, sd=1.32)
result
## $data
## [1] "fancy test data"
## 
## $mean
## [1] 5
## 
## $sd
## [1] 1.32

List of lists

and annoying double brackets

myList <- list(smallLetters=letters[1:10], 
               capitalLetters=LETTERS[1:10], 
               numbers=1:5)
myList
## $smallLetters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## 
## $capitalLetters
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
## 
## $numbers
## [1] 1 2 3 4 5

List of lists

and annoying double brackets

# Accessing multiple elements
myList[1:2]
## $smallLetters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## 
## $capitalLetters
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

List of lists

and annoying double brackets

# Accessing single elements
myList[2][2]
## $<NA>
## NULL
myList[[2]][2]
## [1] "B"
myList[[1:2]]
## [1] "b"

Matrices

m <- matrix(data=1:12, nrow = 3, ncol = 4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Matrices - Element wise operations

m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
# Element-wise operations
m * 0.5
##      [,1] [,2] [,3] [,4]
## [1,]  0.5  2.0  3.5  5.0
## [2,]  1.0  2.5  4.0  5.5
## [3,]  1.5  3.0  4.5  6.0

Matrices - Element wise operations

m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
# Element-wise operations
m * c(1,2,3)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    4   10   16   22
## [3,]    9   18   27   36

Matrices - matrix/ vector multiplication

m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
# Matrix multiplication
m %*% c(1,2,3,4)
##      [,1]
## [1,]   70
## [2,]   80
## [3,]   90

Good explanation of Matrix multiplication: Eli Bendersky's website

Data frames

# Creating a data frame
myfamily <- c("Paula", "Ali", "Emma")
family.frame <- data.frame(index = 1:3, 
                           firstname = myfamily, 
                           surname = rep("Meier", 3))
index firstname surname
1 Paula Meier
2 Ali Meier
3 Emma Meier

Reading the data

Data from bee hives

read.table

read.table

df <- read.table("beedatasmall.csv")
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 2 elements

read.table, second try

df <- read.table("beedatasmall.csv", sep = ",")
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg
1.565081418e+18 4 NA 27.25 26.375 23.4375 23.5 23.25 21.625 NA
1.565081428e+18 4 NA 27.25 26.4375 23.5 23.5 23.25 21.625 NA
1.565081431e+18 13 59.0554 22.875 22.1875 23 21.8125 22.6875 22.875 10.779
1.565081438e+18 4 NA 27.25 26.375 23.5 23.5625 23.25 21.625 NA

read.table, third try

df <- read.table("beedatasmall.csv", sep = ",", header = T)
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg
1.565081e+18 4 NA 27.250 26.3750 23.4375 23.5000 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.4375 23.5000 23.5000 23.2500 21.625 NA
1.565081e+18 13 59.0554 22.875 22.1875 23.0000 21.8125 22.6875 22.875 10.779
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA

read.csv

read.csv

df <- read.csv("beedatasmall.csv")
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg
1.565081e+18 4 NA 27.250 26.3750 23.4375 23.5000 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.4375 23.5000 23.5000 23.2500 21.625 NA
1.565081e+18 13 59.0554 22.875 22.1875 23.0000 21.8125 22.6875 22.875 10.779
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA

common mistake with read.csv

df <- read.csv("sampleData.csv")
name..age..subject..year
Li; 18; chemistry; 1
Svenson Jacob; 25; psychology; 12
Raphaela; 23; psychology; 1

common mistake with read.csv

common mistake with read.csv

df <- read.csv("sampleData.csv", sep=";")
name age subject year
Li 18 chemistry 1
Svenson,Jacob 25 psychology 12
Raphaela 23 psychology 1

Transforming entire columns

df <- read.csv("beedata.csv")
str(df)
## 'data.frame':    171299 obs. of  10 variables:
##  $ time     : num  1.57e+18 1.57e+18 1.57e+18 1.57e+18 1.57e+18 ...
##  $ hive     : int  4 4 13 4 4 4 4 4 4 4 ...
##  $ h        : num  NA NA 59.1 NA NA ...
##  $ t_i_1    : num  27.2 27.2 22.9 27.2 27.2 ...
##  $ t_i_2    : num  26.4 26.4 22.2 26.4 26.4 ...
##  $ t_i_3    : num  23.4 23.5 23 23.5 23.5 ...
##  $ t_i_4    : num  23.5 23.5 21.8 23.6 23.6 ...
##  $ t_i_5    : num  23.2 23.2 22.7 23.2 23.2 ...
##  $ t_o      : num  21.6 21.6 22.9 21.6 21.6 ...
##  $ weight_kg: num  NA NA 10.8 NA NA ...

Transforming entire columns

E.g. converting kg to g:

df$weight_g <- df$weight_kg*1000
time hive h t_i_1 weight_kg weight_g
1.565081e+18 4 NA 27.250 NA NA
1.565081e+18 4 NA 27.250 NA NA
1.565081e+18 13 59.0554 22.875 10.779 10779
1.565081e+18 4 NA 27.250 NA NA
1.565081e+18 4 NA 27.250 NA NA

Transforming entire columns

Marking high temperature values:

df$highTemp <- df$t_i_1>25
time hive h t_i_1 weight_kg weight_g highTemp
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 13 59.0554 22.875 10.779 10779 FALSE
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 4 NA 27.250 NA NA TRUE

Transforming entire columns

Nanoseconds to seconds:

df$timeSeconds <- df$time/1000000000
time hive h t_i_1 weight_kg weight_g highTemp timeSeconds
1.565081e+18 4 NA 27.250 NA NA TRUE 1565081418
1.565081e+18 4 NA 27.250 NA NA TRUE 1565081428
1.565081e+18 13 59.0554 22.875 10.779 10779 FALSE 1565081431
1.565081e+18 4 NA 27.250 NA NA TRUE 1565081438
1.565081e+18 4 NA 27.250 NA NA TRUE 1565081448

Transforming entire columns

Dealing with the timestamp

df$time <- as.POSIXct(df$timeSeconds, origin="1970-01-01")
time hive h t_i_1 weight_kg weight_g highTemp timeSeconds
2019-08-06 10:50:18 4 NA 27.250 NA NA TRUE 1565081418
2019-08-06 10:50:28 4 NA 27.250 NA NA TRUE 1565081428
2019-08-06 10:50:31 13 59.0554 22.875 10.779 10779 FALSE 1565081431
2019-08-06 10:50:38 4 NA 27.250 NA NA TRUE 1565081438
2019-08-06 10:50:48 4 NA 27.250 NA NA TRUE 1565081448

Subsetting data

df[choose_rows, choose_columns]

with some boolean expressions

Subsetting data

Select and plot data from hive 4:

df.4 <- df[df$hive==4,]
plot(df.4$time, df.4$t_o, ylim=c(0,40))

Subsetting data

Select only the first 1000 lines:

df.4.sub <- df.4[1:1000,]
plot(df.4.sub$time, df.4.sub$t_o, ylim=c(0,40))

Subsetting data

Delete columns: / Choose only some columns

names(df)
##  [1] "time"        "hive"        "h"           "t_i_1"       "t_i_2"      
##  [6] "t_i_3"       "t_i_4"       "t_i_5"       "t_o"         "weight_kg"  
## [11] "weight_g"    "highTemp"    "timeSeconds"
df.some <- df[, c(1,2,10)]
time hive weight_kg
2019-08-06 10:50:18 4 NA
2019-08-06 10:50:28 4 NA
2019-08-06 10:50:31 13 10.779
2019-08-06 10:50:38 4 NA
2019-08-06 10:50:48 4 NA

Subsetting data

You can also use:

df.same <- df[, c("time", "hive", "weight_kg")]
time hive weight_kg
2019-08-06 10:50:18 4 NA
2019-08-06 10:50:28 4 NA
2019-08-06 10:50:31 13 10.779
2019-08-06 10:50:38 4 NA
2019-08-06 10:50:48 4 NA

Subsetting data

You can also use:

df.or <- df[, -c(3:9,11:13)]
time hive weight_kg
2019-08-06 10:50:18 4 NA
2019-08-06 10:50:28 4 NA
2019-08-06 10:50:31 13 10.779
2019-08-06 10:50:38 4 NA
2019-08-06 10:50:48 4 NA

Useful functions

  • summary
  • head
  • tail
  • sum
  • mean

Tuesday

Conditions - If and Else

If

If

a <- 19
if(a >= 18){
  print("Yes, you are allowed to see this content.")
}
## [1] "Yes, you are allowed to see this content."

If and else

If and else

goodWeather <- TRUE
if (goodWeather){
  print("Let's go to the beach!")
} else{
  print("Let's eat pizza.")
}
## [1] "Let's go to the beach!"

Else-if

Else-if

do.you.want.this <- "chocolate"
#do.you.want.this <- "cookies"
#do.you.want.this <- "carrots"
if (do.you.want.this == "chocolate"){
  print("Yes, I love chocolate!")
} else if (do.you.want.this == "cookies"){
  print("Yes, if they are with chocolate.")
} else {
  print("Hm, not sure. Do you have chocolate?")
}
## [1] "Yes, I love chocolate!"

Else-if

library(lubridate)
birthday <- as.POSIXct("2019-08-06 10:50:18")
what.do.you.want.to.know <- "year"
#what.do.you.want.to.know <- "month"
#what.do.you.want.to.know <- "chocolate"
if (what.do.you.want.to.know == "year"){
  print(year(birthday))
} else if (what.do.you.want.to.know == "month"){
  print(month(birthday))
} else {
  print("Sorry, what do you want to know?")
}
## [1] 2019

Rain or no rain

Rain or no rain

raincoat <- T
rain <- T
if(raincoat){
  if(rain){
    print("dry")
  } else{
    print("useless raincoat")
  }
} else{
  if(rain){
    print("wet")
  } else{
    print("dry")
  }
}
## [1] "dry"

For

For

myfamily <- list(Paula=31, Ali=29, Emma=5)
age <- 0
for (i in 1:length(myfamily)){
  age <- age + myfamily[[i]]
}
age
## [1] 65
ages <- c(31, 29, 5)
total.age <- 0
for(i in ages){
  total.age <- total.age + i
}
total.age
## [1] 65

For - columns

for (i in 1:ncol(df)){
  name <- names(df)[i]
  colclass <- class(df[,i])
  print(paste(name, ":", colclass))
}
## [1] "time : POSIXct" "time : POSIXt" 
## [1] "hive : integer"
## [1] "h : numeric"
## [1] "t_i_1 : numeric"
## [1] "t_i_2 : numeric"
## [1] "t_i_3 : numeric"
## [1] "t_i_4 : numeric"
## [1] "t_i_5 : numeric"
## [1] "t_o : numeric"
## [1] "weight_kg : numeric"
## [1] "weight_g : numeric"
## [1] "highTemp : logical"
## [1] "timeSeconds : numeric"

For - transform long code …

df <- read.csv("beedata.csv")
print("mean of h:")
mean(df$h)
print("mean of t_i_1:")
mean(df$t_i_1)
print("mean of weight_kg:")
mean(df$weight_kg)
print("missing h values:")
sum(is.na(df$h))
print("missing t_i_1 values:")
sum(is.na(df$h))
print("missing weight_kg values:")
sum(is.na(df$h))

into shorter code

to.analyse <- c("h", "t_i_1", "weight_kg")
for(i in 1:length(to.analyse)){
  print(paste("mean of", to.analyse[i], ":"))
  print(mean(to.analyse[i]))
  print(paste("missing", to.analyse[i], "values:"))
  print(sum(is.na(df[,to.analyse[i]])))
}

For - diagram

The apply family

apply(df, 2, class)
##        time        hive           h       t_i_1       t_i_2       t_i_3 
## "character" "character" "character" "character" "character" "character" 
##       t_i_4       t_i_5         t_o   weight_kg    weight_g    highTemp 
## "character" "character" "character" "character" "character" "character" 
## timeSeconds 
## "character"

The apply family

lapply(df, class)
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $hive
## [1] "integer"
## 
## $h
## [1] "numeric"
## 
## $t_i_1
## [1] "numeric"
## 
## $t_i_2
## [1] "numeric"
## 
## $t_i_3
## [1] "numeric"
## 
## $t_i_4
## [1] "numeric"
## 
## $t_i_5
## [1] "numeric"
## 
## $t_o
## [1] "numeric"
## 
## $weight_kg
## [1] "numeric"
## 
## $weight_g
## [1] "numeric"
## 
## $highTemp
## [1] "logical"
## 
## $timeSeconds
## [1] "numeric"

The apply family

sapply(df, class)
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $hive
## [1] "integer"
## 
## $h
## [1] "numeric"
## 
## $t_i_1
## [1] "numeric"
## 
## $t_i_2
## [1] "numeric"
## 
## $t_i_3
## [1] "numeric"
## 
## $t_i_4
## [1] "numeric"
## 
## $t_i_5
## [1] "numeric"
## 
## $t_o
## [1] "numeric"
## 
## $weight_kg
## [1] "numeric"
## 
## $weight_g
## [1] "numeric"
## 
## $highTemp
## [1] "logical"
## 
## $timeSeconds
## [1] "numeric"

While

While

a <- 2
while (a<1000){
  a <- a^2
  print(a)
}
## [1] 4
## [1] 16
## [1] 256
## [1] 65536

While - diagram

Do-while (Finding o)

letters <- letters[]
repeat{
  i <- as.integer(runif(1,1,26))
  letter <- letters[i]
  print(letter)
  if(letter=='o'){
    break
  }
}
## [1] "o"

bad example for Do-while

a <- 0.5
repeat{
  a <- a - sqrt(a)
  print(a)
  if(sqrt(a) > a){
    break
  }
}
## [1] -0.2071068
## Warning in sqrt(a): NaNs produced
## Error in if (sqrt(a) > a) {: missing value where TRUE/FALSE needed

do-while - diagram

Nice comic

For vs. While

When do you use for? When do you use while? Can you think of examples?

For vs. While

  • if you do not know the number of iterations needed, use while
  • else, always use for!
  • it is easier to create a while-loop that never stops, than creating a for-loop that never stops.

Writing own functions

A simple function

myPlus <- function(a, b){
  return(a + b)
}
myPlus(1,2)
## [1] 3

Default parameters

myPlus <- function(a=1, b=1){
  return(a + b)
}
myPlus(1,2)
## [1] 3
myPlus()
## [1] 2

Returning multiple parameters

dataframe.info <- function(df){
  cells.count <- ncol(df)*nrow(df)
  return(list(columns=ncol(df), rows=nrow(df), cells= cells.count))
}
dataframe.info(df)
## $columns
## [1] 13
## 
## $rows
## [1] 171299
## 
## $cells
## [1] 2226887

Information for tomorrow

  • Bring some data from your studies or professional background
  • or find some other open data you are interested in

Wednesday

Debugging

Where are the bugs?

funny.words <- function(s = "summer", count = 3){
  s = "summer"
  s1 <- ""
  for(i in 1:2){
    for(i in 1:count){
      print(substr(s, i, i))
      s1 <- paste(s1, substr(s, i, i), sep = "")
    }
  }
  return(s1)
}
funny.words()
## [1] "s"
## [1] "u"
## [1] "m"
## [1] "s"
## [1] "u"
## [1] "m"
## [1] "sumsum"
funny.words("banana", 5)
## [1] "s"
## [1] "u"
## [1] "m"
## [1] "m"
## [1] "e"
## [1] "s"
## [1] "u"
## [1] "m"
## [1] "m"
## [1] "e"
## [1] "summesumme"

Debugging!

Your toolkit

  • If you click on Next, R will execute the next line. (e.g. line 6)
  • If you click on the second button, R will step in the current function call, so it will basically jump into an other function. (e.g. into the print function)
  • If you click on the third button, R will execute the rest of the current function or loop. (e.g. line 6 and 7)
  • If you click on “continue”“, R will run until we come across the next breakpoint. (e.g. in the next round of the loop or in line 10)
  • If you click on “Stop”, R will exit the debug mode.

pair-programming task

Use the debugging options to find all mistakes in the funny.words function.

solution

Plotting

Plotting

  • many, many libraries
  • we are going to look at base, ggplot and plotly

A basic plot (Standard library)

df.4 <- df[df$hive==4,]
# plot temperature outside
plot(df.4$time, df.4$t_i_3)

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim=c(0,40)) # added this

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     xlab = "Time (2019)",  # added this
     ylab = "Temperature within hive", # added this
     main = "Sensor measurements") # added this

Improving (Standard library)

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     type = "b", # added this
     lty = 1, # added this
     xlab = "Time (2019)", 
     ylab = "Temperature within hive", 
     main = "Sensor measurements")

Points or Lines? (type)

  • “p”: Points
  • “l”: Lines
  • “b”: Both

Line type (lty)

Improving (Standard library)

Improving (Standard library)

df.4 <- df.4[df.4$t_i_3>5&df.4$t_i_3<40,] # added this
# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     type = "b",
     lty = 1,
     xlab = "Time (2019)", 
     ylab = "Temperature within hive", 
     main = "Sensor measurements")

Improving (Standard library)

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     type = "b",
     lty = 1,
     pch = 4, # added this
     xlab = "Time (2019)", 
     ylab = "Temperature within hive", 
     main = "Sensor measurements")

Improving (Standard library)

Point types (pch)

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     type = "b",
     lty = 1,
     pch = 4,
     xlim = as.POSIXct(c("2019-08-08", "2019-08-09")), # added this
     xlab = "Time (2019-08-08)", 
     ylab = "Temperature within hive", 
     main = "Sensor measurements")

Improving (Standard library)

Improving (Standard library)

# plot temperature outside
plot(df.4$time, df.4$t_i_3, 
     ylim = c(0,40), 
     type = "b",
     lty = 1,
     pch = 4,
     xlim = as.POSIXct(c("2019-08-08", "2019-08-09")),
     xlab = "Time (2019-08-08)", 
     ylab = "Temperature within hive", 
     main = "Sensor measurements",
     xaxt="n")
axis.POSIXct(1, 
             at=seq(min(df.4$time), max(df.4$time), by="1 hour"), 
             format="%H:00") # added this

Improving (Standard library)

Even more complex (Standard library)

# subset data
df.4 <- df[df$hive==4,]
# plot temperature outside
plot(df.4$time, df.4$t_o, ylim=c(0,40),type = 'p', pch=4)
# choose colours
cl <- rainbow(5)
# choose colums
cols <- 4:8
# plot each column
for (i in 1:5){
    lines(df.4$time, 
          df.4[,cols[i]],
          col = cl[i],
          type = 'p', 
          pch=4, 
          ylim=c(0,40))
}
# add legend
legend("topright", legend=c(1, 2, 3, 4, 5, "outside"),
       col=c(cl, "black"), pch = 4, lty = 0, cex=0.8)

Even more complex (Standard library)

# add legend
legend("topright", legend=c(1, 2, 3, 4, 5, "outside"),
       col=c(cl, "black"), pch = 4, lty = 0, cex=0.8)

Even more complex (Standard library)

basic plot (ggplot)

# plot data
library(ggplot2)
ggplot(data = df.4, aes(x=time, y=t_i_3)) + geom_point()

improving (ggplot)

# plot data
library(ggplot2)
ggplot(data = df.4, aes(x=time, y=t_i_3)) + geom_point(shape=4) + 
  ylim(c(0, 40)) + 
  xlab("Time (2019") + 
  ylab("Temperature within hive") + 
  ggtitle("Sensor measurements")

improving (ggplot)

## Warning: Removed 152 rows containing missing values (geom_point).

more complex (ggplot)

# subset data
df.4 <- df[df$hive==4,]
# choose columns
df.4.cols <- df.4[,c(1,4:9)]
# reshape data
library(reshape)
mdf <- melt(df.4.cols, id=c("time")) 
# plot data
library(ggplot2)
ggplot(data = mdf, aes(x=time, y=value)) + 
  geom_line(aes(colour=variable)) + 
  ylim(c(0, 40))

more complex (ggplot)

Basic plot (plotly)

library(plotly)
fig <- plot_ly(df.4[1:100,], x = ~time, y = ~t_i_3)
fig

Basic plot (plotly)

Saving your plot

png("test.png")
plot(hist(rnorm(100)))
dev.off()

Piping

The basic piping operator

library(magrittr)
x <- 9
# Calculate the square-root of x
sqrt(x)
## [1] 3
# Calculate it using pipes
x %>% sqrt
## [1] 3

Updating x

x <- 9
# Calculate the square-root of x and update x
x <- sqrt(x)
x
## [1] 3
# Calculate it using pipes
x <- 9
x %<>% sqrt
x
## [1] 3

real-world example

nrow(subset(df, hive==4))
## [1] 60193
df %>% subset(hive==4) %>% nrow
## [1] 60193

Homework

Let's look at the homework together

Thursday

R-Markdown

R-Markdown is great!

Output formats

  • html
  • pdf
  • word

Header - html

# html
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: html_document
---

Header - Word

# word
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: word_document
---

Options

---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: 
  html_document:
    toc: true # print a table of content
    toc_depth: 2 # maximal depth of the table of content
    number_sections: true # automatically number the sections
    code_folding: hide # have buttons to show/ hide code
    theme: united # specify a theme
    toc_float: true # generate a navigation bar

---

Document structure

# Header 1

## Header 2

### Header 3

Code blocks

Use the insert button. You have following further options:

  • eval: If set to true, the code will be executed
  • echo: If set to true, the output will be printed/ plotted
  • include: If set to true, the code will be printed.

Lists

* unordered list
    + sub-item 1
    + sub-item 2
        - sub-sub-item 1
* item 2

    
1. ordered list
2. item 2
    i) sub-item 1

Lists - how they look

  • unordered list
    • sub-item 1
    • sub-item 2
      • sub-sub-item 1
  • item 2
  1. ordered list
  2. item 2
    1. sub-item 1

Insert an image

![Some description](path to image)

Further documentation

Assignment

Assignment

  • Create an R-Markdown document
  • Choose a dataset
  • Think about the following question: What do you want fo find out? For a large assignment, you should write down a proper motivation for your analysis.
  • Choose 3 of the following tasks for a small assignment and 8 for a large assignment. You can also come up with own ideas.
  • Write a short text for each task explaining what you have been doing
  • If you are doing the large assignment, make sure to write about the origin of your data, the motivation for the analysis and discuss your results

Ideas

  • Make some exploratory analysis:
  • print the mean/ median/ standard deviation for each column
  • print the total number of missing values
  • print the number of missing values for each column
  • Transform some of the columns/ Create new columns based on existing ones
  • Create a subset of your data
  • Exclude all rows with missing values
  • Create a plot
  • Write a function that takes a filename and some additional parameters. The function should create a plot and save it as “png” with that filename
  • Create a linear model

Enjoy R ;)